Introduction to scRNAseq analysis using scanpy

Workshop

Author
Affiliation
Irene Robles

CoSyne Therapeutics

Published

March 1, 2024

Goal

Introduction to scRNAseq analysis using scanpy

Who am I?

  • Irene Robles Rebollo
  • BSc in Biochemistry
  • MSc in Applied Biosciences and Biotechnology
  • PhD in Cell Biology
    • Half experimental
    • Half computational
  • Bioinformatician at Cosyne Therapeutics

History

Year Achievement
1953 DNA structure
1985 First PCR - DNA amplification
1988 PCR with Taq polymerase - Automated DNA amplification
1993 First qPCRs - RNA amplification
1995 Microarrays - Quantify an array gene expression using a chip
2001 First draft of the human genome
2003 First NGS DNA sequencer - High throughput DNA sequencing
2006 First RNA-seq - High throughput RNA sequencing
2009 First single-cell RNA-seq (Tang et al. 2009)
2013 Single-cell RNA-seq is declared method of the year

(Zhu et al. 2020; Gondane and Itkonen 2023)

Single-cell vs bulk RNAseq

Bulk RNA-seq, LyfeFuel from Unsplash

scRNA-seq, VD Photography from Unsplash

Feature Bulk data Single-cell data
Cell resolution Average of all cells Individual cell resolution
Sample representation Vector of gene expression values Matrix of gene expression values
Genomic resolution Higher, depends on sequencing depth Lower, depends on starting material
Cost Lower High
Computational requirements Lower Higher
Data size Lower Higher
Data interpretation Simple Complex

Sample representation

  • In bulk data, each sample is repressented by a vector, where each value is a gene.
  • In single cell data, each sample is a matrix, where each row is a gene and each column is a cell.

\[\begin{align} Bulk &= \begin{bmatrix} gene_{1} \\ gene_{2} \\ gene_{3}\\ \vdots \\ gene_{n} \end{bmatrix} \\ \\ Single-cell &= \begin{bmatrix} gene_1, cell_1 & gene_1, cell_2 & gene_1, cell_3 & \dots & gene_1, cell_m \\ gene_2, cell_1 & gene_2, cell_2 & gene_2, cell_3 & \dots & gene_2, cell_m \\ gene_3, cell_1 & gene_3, cell_2 & gene_3, cell_3 & \dots & gene_3, cell_m \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ gene_n, cell_1 & gene_n, cell_2 & gene_n, cell_3 & \dots & gene_n, cell_m \end{bmatrix} \end{align}\]

Scanpy vs Seurat

  • Scanpy is a Python package for single-cell analysis (Wolf, Angerer, and Theis 2018)
  • Seurat is an R package for single-cell analysis (Satija et al. 2015)
  • Both are:
    • User-friendly tools for single-cell analysis
    • Open source
    • Well-documented
    • Widely-used
  • Choice depends on:
    • Language preference
    • Team expertise
    • Integration with downstream analysis
    • Speed and memory requirements

Hint: A good bioinformatician is not restricted by language. You can use R in Python can be done using the rpy2 package. And Python can be use within R using reticulate.

Scale of scRNAseq data

Number of cells per study over years (Svensson, Veiga Beltrame, and Pachter 2020)

AnnData object

AnnData object, source: scanpy web

Set up

  • Install Miniconda
  • Create a new environment
conda create -n myscanpy python=3.10
conda activate myscanpy
pip install -r requirements.txt
python -m ipykernel install --user --name myscanpy --display-name "Python (myscanpy)"

Go to Google Colab

Open a notebook from Github.

Introduce https://github.com/IreneRobles/scrnaseq_workshops and select workshop_notebook.ipynb

Import libraries

import scanpy as sc
import scrublet as scr
import numpy as np
import os
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
import seaborn as sns

Scanpy setttings

sc.settings.verbosity = 3   # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')
scanpy==1.9.8 anndata==0.10.5.post1 umap==0.5.5 numpy==1.26.4 scipy==1.12.0 pandas==2.2.1 scikit-learn==1.4.1.post1 statsmodels==0.14.1 igraph==0.11.4 pynndescent==0.5.11

Download data

mkdir data
cd data
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE190nnn/GSE190622/suppl/GSE190622%5Fcount%5Fmatrix%5FAnnotated.csv.gz

Load tests datasets

Single-cell SMART-seq data from mouse macrophages (Robles-Rebollo et al. 2022)

mf = pd.read_csv("data/GSE190622_count_matrix_Annotated.csv.gz", index_col=0)
adata = sc.AnnData(X=mf.T)
adata.obs["genotype"] = adata.obs_names.to_series().apply(lambda x: x.split("_")[0])
adata.obs["timepoint"] = adata.obs_names.to_series().apply(lambda x: x.split("_")[1])
adata.obs["sample"] = adata.obs["genotype"] + "_" + adata.obs["timepoint"]
adata.var_names_make_unique()
adata
AnnData object with n_obs × n_vars = 1362 × 18429
    obs: 'genotype', 'timepoint', 'sample'

Doublets

  • A doublet is a single-cell transcriptome that has been generated by two cells
  • They have often have a higher read and gene count
  • They can produce confounding results

Doublets, adapted from (Wolock, Lopez, and Klein 2019)

How do we fight doublets: - Experiment design - Computationally

Scrublet

Scrublet finds doublets based os simulations (Wolock, Lopez, and Klein 2019)

Scrublet algorithm overview (Wolock, Lopez, and Klein 2019)

scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets()
scrub.plot_histogram()
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.41
Detected doublet rate = 0.1%
Estimated detectable doublet fraction = 25.8%
Overall doublet rate:
    Expected   = 6.0%
    Estimated  = 0.3%
Elapsed time: 23.4 seconds
(<Figure size 640x240 with 2 Axes>,
 array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>,
        <Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>],
       dtype=object))

Adjust threshold

predicted_doublets = scrub.call_doublets(threshold=0.2)
scrub.plot_histogram()
Detected doublet rate = 0.7%
Estimated detectable doublet fraction = 50.8%
Overall doublet rate:
    Expected   = 6.0%
    Estimated  = 1.4%
(<Figure size 640x240 with 2 Axes>,
 array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>,
        <Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>],
       dtype=object))

Visualise doublets

scrub.set_embedding('UMAP', scr.get_umap(scrub.manifold_obs_, 10, min_dist=0.3))
scrub.plot_embedding('UMAP', order_points=True);

Filter out doublets

adata.obs["scrublet_doublet"] = predicted_doublets
adata = adata[adata.obs["scrublet_doublet"].apply(lambda x: x is False)]

Preprocessing

Highest expressing genes

Look for suspects: MALAT1, mitochondrial genes, ribosomal genes, componenets of the cytoskeleton, etc.

sc.pl.highest_expr_genes(adata, n_top=20)
normalizing counts per cell
    finished (0:00:00)

Quality metrics

sc.pp.calculate_qc_metrics computes quality control metrics for each cell. - Number of counts per cell - Number of genes per cell - Percentage of counts that come from mitochondrial genes.

# Mitochondrial genes
adata.var['mt'] = adata.var_names.str.startswith('mt-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True, rotation = 90)

Different sampes might require different thresholds

sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True, groupby='sample', rotation = 90)

sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt', color='sample')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', color='sample')

In this case, cells from different samples are reasonably homogeneous. However, that is not always the cells_per_paper.ipynb

Example case where different tissues might require different quality control threshols, fraction of Tabula Muris data (Schaum et al. 2018)

Filtering

Filter cells based on quality metrics

sc.pp.filter_cells(adata, min_genes=500)
sc.pp.filter_genes(adata, min_cells=5)
adata= adata[adata.obs.n_genes_by_counts < 5000, :]
adata= adata[adata.obs.pct_counts_mt < 15, :]
filtered out 1 genes that are detected in less than 5 cells

Normalisation: size normalisation and log transformation

For simplicity, we will use the simplest method: library size normalisation and log transformation, but there are others Hafemeister and Satija (2019).

Steps:

  • Transcript length normalisation:
    • Adjusts for differences in transcript length between genes
    • Divides the counts in each cell by the length of the transcript
    • Not necessary if you sequence a fixed region of the transcript ( 3’ or 5’ in 10X, our case)
  • Library size normalisation:
    • Adjusts for differences in sequencing depth between cells
    • Divides the counts in each cell by the total counts in that cell and multiplies by a scale factor (e.g. 10,000)
sc.pp.normalize_total(adata, target_sum=1e4)
normalizing counts per cell
    finished (0:00:00)
  • Log transformation:
    • Logarithm of the normalised counts
    • Makes the data more normally distributed
sc.pp.log1p(adata)

Highly variable genes

  • Genes that show the most variation across cells
  • Often the most informative genes
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)

Dimensionality reduction

Goal

  • Reducing the number of input variables or features in a dataset.

Why

  • Noise Reduction: Focus on the most variance-contributing features, thus enhancing the signal-to-noise ratio.
  • Data Visualization: We can plot in 2 or 3 dimensions
  • Aid clustering and clasification
  • Computational efficiency

Dimensionality reduction techniques

  • Principal Component Analysis (PCA): PCA is often the first step in dimensionality reduction for single-cell data. It identifies the principal components that capture the most variance in the data, reducing its dimensionality while preserving its structure as much as possible.
  • t-Distributed Stochastic Neighbor Embedding (t-SNE): Popular dimensionality reduction that foccusses on the preservation on local structures
  • Uniform Manifold Approximation and Projection (UMAP): Popular dimensionality reduction that foccusses on the preservation on global structures

Prepare data for dimensionality reduction

The .raw attribute freezes the state of the AnnData object for later use. We can recumerate it by calling .raw.to_adata().

adata.raw = adata

Filter by highly variable genes

adata= adata[:, adata.var.highly_variable]
adata
View of AnnData object with n_obs × n_vars = 1277 × 3193
    obs: 'genotype', 'timepoint', 'sample', 'scrublet_doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_genes'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'sample_colors', 'log1p', 'hvg'

Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed.

sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
regressing out ['total_counts', 'pct_counts_mt']
    finished (0:00:02)

Scale each gene to unit variance. Clip values exceeding standard deviation to 10.

sc.pp.scale(adata, max_value=10)

Principal component analysis (PCA)

PCA is often the first step in dimensionality reduction for single-cell data. It identifies the principal components that capture the most variance in the data, reducing its dimensionality while preserving its structure as much as possible.

sc.tl.pca(adata, svd_solver='arpack')
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:42)
sc.pl.pca(adata, color='sample')

Varinace explained by each component:

sc.pl.pca_variance_ratio(adata, log=True)

UMAP and t-SNE

First, we need to compute the neighbourhood graph:

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=20)
computing neighbors
    using 'X_pca' with n_pcs = 20
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)

Compute UMAP:

sc.tl.umap(adata)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:01)

Compute t-SNE:

sc.tl.tsne(adata, n_pcs = 20)
computing tSNE
    using 'X_pca' with n_pcs = 20
    using sklearn.manifold.TSNE
    finished: added
    'X_tsne', tSNE coordinates (adata.obsm) (0:00:03)

Plot UMAP:

sc.pl.umap(adata, color='sample')

Plot t-SNE:

sc.pl.tsne(adata, color='sample')

Clustering the neighborhood graph

Leiden algorithm. The Leiden algorithm starts from a singleton partition (a). The algorithm moves individual nodes from one community to another to find a partition (b), which is then refined (c). An aggregate network (d) is created based on the refined partition, using the non-refined partition to create an initial partition for the aggregate network. For example, the red community in (b) is refined into two subcommunities in (c), which after aggregation become two separate nodes in (d), both belonging to the same community. The algorithm then moves individual nodes in the aggregate network (e). In this case, refinement does not change the partition (f). These steps are repeated until no further improvements can be made. Figure 3 from (Traag, Waltman, and Van Eck 2019)

sc.tl.leiden(adata)
sc.pl.umap(adata, color=['sample','leiden'], use_raw=False)
running Leiden clustering
    finished: found 11 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)

Finding marker genes

Genes differentially expressed in each cluster.

sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:01)

pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)
0 1 2 3 4 5 6 7 8 9 10
0 Ccl5 Mapkapk2 Ctsd Sepp1 Ccnd1 Tnfsf9 Gm12155 Hmgb2 Hist1h2ap Ddit3 Tubb5
1 Il1rn Ebi3 Laptm5 Tnfsf9 Lgals3 Cxcl2 Gm13341 Tuba1b Hist3h2a Atf4 Cdca3
2 Isg15 Il1b Gm42418 Nfkbiz Ifi27l2a Tnf Gm42836 Tnf Trim17 Hspa5 Ube2c
3 Mndal Serp1 Lamp1 Cxcl1 Rnh1 Ccl4 Gm29216 Tubb5 H2afx Slc3a2 Birc5
4 Saa3 Ehd1 Sepp1 Serinc3 Lpl Id2 Mak Top2a Hist1h2aa Sqstm1 Plk1

Plot some markers:

sc.pl.umap(adata, color=['leiden', 'Ccl5', 'Mapkapk2', 'Ctsd', 'Sepp1', 'Ccnd1','Tnfsf9','Gm12155','Hmgb2','Hist1h2ap','Ddit3','Tubb5'], use_raw=True)

During the course of this analysis, the AnnData accumlated the following annotations.

adata
AnnData object with n_obs × n_vars = 1277 × 3193
    obs: 'genotype', 'timepoint', 'sample', 'scrublet_doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_genes', 'leiden'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'sample_colors', 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'tsne', 'leiden', 'leiden_colors', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap', 'X_tsne'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Session info

!pip install session-info
import session_info
session_info.show()
Looking in indexes: https://pypi.org/simple, https://__token__:****@gitlab.com/api/v4/groups/54410195/-/packages/pypi/simple
Requirement already satisfied: session-info in /Users/ireneroblesrebollo/anaconda3/lib/python3.10/site-packages (1.0.0)
Requirement already satisfied: stdlib-list in /Users/ireneroblesrebollo/anaconda3/lib/python3.10/site-packages (from session-info) (0.8.0)
Click to view session information
-----
anndata             0.10.5.post1
matplotlib          3.8.3
numpy               1.26.4
pandas              2.2.1
polars              0.20.13
scanpy              1.9.8
scrublet            NA
seaborn             0.13.2
session_info        1.0.0
-----
Click to view modules imported as dependencies
PIL                         10.2.0
annoy                       NA
anyio                       NA
appnope                     0.1.4
arrow                       1.3.0
asttokens                   NA
attr                        23.2.0
attrs                       23.2.0
babel                       2.14.0
certifi                     2024.02.02
cffi                        1.16.0
charset_normalizer          3.3.2
comm                        0.2.1
cycler                      0.12.1
cython_runtime              NA
dateutil                    2.9.0
debugpy                     1.8.1
decorator                   5.1.1
defusedxml                  0.7.1
exceptiongroup              1.2.0
executing                   2.0.1
fastjsonschema              NA
fqdn                        NA
h5py                        3.10.0
idna                        3.6
igraph                      0.11.4
ipykernel                   6.29.3
ipywidgets                  8.1.2
isoduration                 NA
jedi                        0.19.1
jinja2                      3.1.3
joblib                      1.3.2
json5                       NA
jsonpointer                 2.4
jsonschema                  4.21.1
jsonschema_specifications   NA
jupyter_events              0.9.0
jupyter_server              2.12.5
jupyterlab_server           2.25.3
kiwisolver                  1.4.5
lazy_loader                 NA
leidenalg                   0.10.2
llvmlite                    0.42.0
markupsafe                  2.1.5
matplotlib_inline           0.1.6
mpl_toolkits                NA
natsort                     8.4.0
nbformat                    5.9.2
numba                       0.59.0
overrides                   NA
packaging                   23.2
parso                       0.8.3
patsy                       0.5.6
pexpect                     4.9.0
platformdirs                4.2.0
prometheus_client           NA
prompt_toolkit              3.0.43
psutil                      5.9.8
ptyprocess                  0.7.0
pure_eval                   0.2.2
pyarrow                     15.0.0
pycparser                   2.21
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.9.5
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.17.2
pynndescent                 0.5.11
pyparsing                   3.1.1
pythonjsonlogger            NA
pytz                        2024.1
referencing                 NA
requests                    2.31.0
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rpds                        NA
scipy                       1.12.0
send2trash                  NA
six                         1.16.0
skimage                     0.22.0
sklearn                     1.4.1.post1
sniffio                     1.3.1
stack_data                  0.6.3
statsmodels                 0.14.1
texttable                   1.7.0
threadpoolctl               3.3.0
tornado                     6.4
tqdm                        4.66.2
traitlets                   5.14.1
typing_extensions           NA
umap                        0.5.5
uri_template                NA
urllib3                     2.2.1
wcwidth                     0.2.13
webcolors                   1.13
websocket                   1.7.0
yaml                        6.0.1
zmq                         25.1.2
zoneinfo                    NA
-----
IPython             8.22.1
jupyter_client      8.6.0
jupyter_core        5.7.1
jupyterlab          4.1.2
notebook            7.1.1
-----
Python 3.10.13 (main, Sep 11 2023, 08:16:02) [Clang 14.0.6 ]
macOS-13.3.1-arm64-arm-64bit
-----
Session information updated at 2024-03-01 17:02

Inspirations

References

Gondane, Aishwarya, and Harri M Itkonen. 2023. “Revealing the History and Mystery of RNA-Seq.” Current Issues in Molecular Biology 45 (3): 1860–74.
Hafemeister, Christoph, and Rahul Satija. 2019. “Normalization and Variance Stabilization of Single-Cell RNA-Seq Data Using Regularized Negative Binomial Regression.” Genome Biology 20 (1): 296.
Risso, Davide, Fanny Perraudeau, Svetlana Gribkova, Sandrine Dudoit, and Jean-Philippe Vert. 2018. “A General and Flexible Method for Signal Extraction from Single-Cell RNA-Seq Data.” Nature Communications 9 (1): 284.
Robles-Rebollo, Irene, Sergi Cuartero, Adria Canellas-Socias, Sarah Wells, Mohammad M Karimi, Elisabetta Mereu, Alexandra G Chivu, et al. 2022. “Cohesin Couples Transcriptional Bursting Probabilities of Inducible Enhancers and Promoters.” Nature Communications 13 (1): 4342.
Satija, Rahul, Jeffrey A Farrell, David Gennert, Alexander F Schier, and Aviv Regev. 2015. “Spatial Reconstruction of Single-Cell Gene Expression Data.” Nature Biotechnology 33 (5): 495–502.
Schaum, Nicholas, Jim Karkanias, Norma F Neff, Andrew P May, Stephen R Quake, Tony Wyss-Coray, Spyros Darmanis, et al. 2018. “Single-Cell Transcriptomics of 20 Mouse Organs Creates a Tabula Muris: The Tabula Muris Consortium.” Nature 562 (7727): 367.
Svensson, Valentine, Eduardo da Veiga Beltrame, and Lior Pachter. 2020. “A Curated Database Reveals Trends in Single-Cell Transcriptomics.” Database 2020: baaa073.
Tang, Fuchou, Catalin Barbacioru, Yangzhou Wang, Ellen Nordman, Clarence Lee, Nanlan Xu, Xiaohui Wang, et al. 2009. “mRNA-Seq Whole-Transcriptome Analysis of a Single Cell.” Nature Methods 6 (5): 377–82.
Traag, Vincent A, Ludo Waltman, and Nees Jan Van Eck. 2019. “From Louvain to Leiden: Guaranteeing Well-Connected Communities.” Scientific Reports 9 (1): 5233.
Wolf, F Alexander, Philipp Angerer, and Fabian J Theis. 2018. “SCANPY: Large-Scale Single-Cell Gene Expression Data Analysis.” Genome Biology 19: 1–5.
Wolock, Samuel L, Romain Lopez, and Allon M Klein. 2019. “Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data.” Cell Systems 8 (4): 281–91.
Zhu, Hanliang, Haoqing Zhang, Ying Xu, Soňa Laššáková, Marie Korabečná, and Pavel Neužil. 2020. “PCR Past, Present and Future.” Biotechniques 69 (4): 317–25.